Credit Card Predictions

Antonio Debouse, Blake Freeman, Bodie Franklin, Eric Romero

Business Understanding

Credit card companies are always in search of better ways to monitor borrowers to determine if the credit card borrower will default on their credit card payments or make them in full. Defaulted credit card payments are often difficult to recoup and create losses for these companies. Defaulting on a payment is defined as not meeting the debt obligation (which is the credit card payment). Our dataset is composed of 24 attributes and 30,000 records that reflect a Taiwanese credit card borrower’s payment history over a six month period.The data was pulled from UCI machine learning repository. The purpose of the dataset is to provide attributes at different points in their payment history to identify if a credit card borrower will default on their payments or pay in full. Since the dataset captures six payment periods, it gives the credit card firm a chance to identify if default will occur or not in various billing cycles. The effectiveness of a good classification algorithm is one that produces strong accuracy, sensitivity, and specificity scores through cross validation. If an effective classification model can be built, the credit company will have the ability to proactively monitor borrowers in various credit stages. The significance of identifying default or not will allow the credit card to minimize their losses. If early default identification occurs, the credit card company can reduce the borrower’s credit limits or preemptively work with the borrower to create new repayment plans. Both outcomes will help the credit company reduce their losses that would occur if no action were taken.

Data Meaning Type

Amount of the given credit (NT dollar): nominal scale. combined total of credit (amount of money) given to the individual borrower and their family.

Gender: Categorical variable. 1 represents male and 2 represents female.

Education: ordinal scale 1 represents the highest level of education and 4 would be the lowest. 1 = graduate school, 2 = university, 3 = high school and 4 = others. Values 0,5,6 are undefined.

Marital status: Categorical variable. 1 = married, 2 = single, 3 = others. Value 0 is undefined.

Age: numerical. This attribute would be nominal. Measures how old a borrower is.

PAY_0 to PAY6: Categorical scale, these attributes describe the past monthly payment status of each made. For example, PAY_0 represents the payment status in September 2005 and PAY_6 represents the payment status in April 2005. -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.

BILL_AMT1 to BILL_AMT6: nominal scale. This value represents the amount of the credit card bill in each respective month.

PAY_AMT1 to PAY_AMT2: nominal scale. This value represents the amount of the credit card bill paid in each respective month.

Default payment next month: categorical scale. 1 represents a default or missed payment. 0 represents payment made.

Data Quality

The data that was pulled was fairly clean to start when reviewing this data. However we did notice some factors that were not in the defined range of the data. This was apparent in categorical columns of Education and Marriage. Education had 3 additional values of 0, 5, 6 which occurred 345 times out of the 30,000 values in this column. Marriage had a value of 0 which occurred 54 times out of the 30,000 values. We addressed both these mistakes in the data by adding them to the “other” column that is denoted in each category. We decided to include these variables since it looked like a misclassification of the data type.

Cleaning the Data

Data Quality Cont.

Of all the accounts in the dataset, 22.12% defaulted (6636 total accounts). The 3.41% of married females with a university eductation who defaulted makes up the largest percentage of the defaulted group at 15.42%, while single females with a university are the second highest group to default(2.91%) accounting for 13.15% of the defaults, followed by the single female with a graduate school education default rate of 2.49% to account for 11.26% of total defaults.

The 0.013% of the single males with an "other" education group that defaulted makes up the smallest portion of the defaulted group at 5.877e-4%, second is the married females with an "other" education group who default rate is 0.023%, followed by a tie for the third lowest default rate at 0.033% between graduated educated males with an "other" martial status and married males with an "other" education.

Explore Joint Attributes

Utilizing a correlation plot we first identified attributes that showed strong relationships to those that defaulted. This was the highest for the history of past payments which recorded data on payment delays before defaulting. Payment delay data specifically at the earlier time frames showed the highest correlation to default data.

Graph showing correlation matix. When reviewing the graph we identified that only Pay 0 through Pay 6 had a high correlation to the variable of intrest of default payment next month.

Visualize Attributes

Explore Attributes and Class

Examining the differences in payment history was best viewed in a violin plot comparison that clearly indicates an increased likelihood of default for increased payment delays. This makes sense logically as increased delays in payment will most likely lead to a client defaulting. It is also noted that the later the client begins to delay payments also increases risk of default as the client is likely begins to struggle making payments after having made several previous payments, this especially increases as the delays are continued specifically at 3 months and beyond.

The does appear to be heavy skewness towards old customers and higher limit balances for both sexes. The median Age, Education, and Marriage status are the same for the total group and the filtered dataset with just defaulted accounts. When grouping by SEX, the average for males appear to be higher(see boxplot above) than the average female no matter the default status.

However, the median limit balance of $140,000 is higher than the $90,000 median limit balance for the defaulted group. Contrary to the average age between the two sex, the average limit balance for females is higher than males no matter the default status.

New Features

We created from existing categories based on Gender, Education and Marital Status. These were combined into new groupings listed below:

Gender/Education/Marital Status

Gender/Education

Gender/Marital Status

Education/Marital Status

This is to look if there was more than a binary relationship to defaulted payments being predicted. We tried this approach to see if we could provide more dimensionality to these categorical variables so that we could use them to show a higher correlation to our goal if we can predict default payments each month. In addition based on the correlation Matrix we could did create a new column that would Pay columns which are strongly correlated to one another. This action also was reproduced on the Bill Amount columns as well. This is to try to get a more highly correlated column to default payment next month column which is predicted value.

Exceptional Work

We utlized feature creation in this project to further explore the data. This was from combining more catagorical variables together to taking averages of several columns to create new coulmns in our data. This is in a effort to draw better results from the variables that are present. This provided futher insight into the work showing us that the catagorical variables of Gender, Education and Marital Status does have a large impact if someone will defult or not.

Conclusion

Through our EDA, we were not able to confidently identify strong predictors in predicting default payment of credit card borrowers. Therefore, we believe further transformations of the data set is appropriate to identify stronger correlations in order to predict our variable of interest . We will explore logarithmic changes and create additional interaction variables. Ultimately as the data stands we believe that a random forest model would be the best predictive model.The main reason is due to the low amount of present correlations to our identified target variable of default payment next month.